June 2024

Scraping, Cleaning & Loading Pipeline

Automated collection of 8,000+ records from 9 PSL seasons (2016–2024)

Project Summary

This project showcases an end-to-end ETL (Extract, Transform, Load) pipeline using dynamic web scraping with Selenium and Python. It collects match-level player stats from ESPNcricinfo, cleans the data using Pandas, and stores the final structured records in AWS RDS for further analysis. Real-world data engineering and automation are demonstrated throughout.

Sample Results

Web to Table: Screenshots show scraped pages from ESPNcricinfo and the final cleaned datasets.

Web Scraped Cleaned Data Cleaned Data

View Complete Dataset on Kaggle

Pipeline Architecture

Overview: A multi-threaded scraper automates Microsoft Edge to navigate and extract dynamic HTML from PSL match pages. Using concurrent threading, multiple matches are scraped in parallel. BeautifulSoup parses the HTML, Pandas cleans the data, and SQLAlchemy loads it into AWS RDS Aurora securely.

1. Extraction

  • Selenium
  • ThreadPoolExecutor
  • BeautifulSoup

2. Transformation

  • Pandas DataFrames
  • CSV Storage

3. Loading

  • SQLAlchemy ORM
  • AWS RDS MySQL

Key Achievements

Metric Result Impact
Collection Speed < 15 minutes 75% faster than sequential
Data Validity 99% valid records Reliable analytics base
Storage AWS RDS Cloud-ready complete data (2016-2024)

What I Learned: Hands-on experience with Selenium, concurrency, data cleaning, and cloud databases. Learned how to optimize scraping at scale and parse inconsistent HTML structures.

Applications

  • Data dashboards for player trends
  • ML models for performance prediction
  • EDA and cricket analytics
  • Fantasy league tools and blog content